I am inheriting many formulas in this workbook SYS 6018 lab file and in the textbook An Introduction to Statistical Learning. (James, 2013), from live session (live.rmd) files provided for the SYS6018 course developed by Professor Scott Schwartz, who I thank and credit for feedback to steer and inform the myriad of the approaches you will see in this analysis. I also well as from general resource documentation provided by RStudio in https://rstudio.com/resources/cheatsheets/. All other programming is my own, though I have developed working knowledge from many online forums. I may also use online solutions to verify solutions as is permitted for the course. I will cite those sources throughout.
I am using functions from the following packages for this assignment, most of which have been discussed or used in the course already:
library(leaflet)
library(PRROC)
library(tidyverse)
library(reactable)
library(broom)
library(forcats)
library(caret)
library(stats)
library(randomForest)
library(car)
library(ggplot2)
library(plotly)
library(leaps)
library(gridExtra)
library(mapview)
library(lwgeom)
library(sf)
library(rgeos)
library(pls)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggspatial)
library(doMC)
registerDoMC(cores=9)Following 2010 Haiti Earthquake link “it was known that the people whose homes had been destroyed by the earthquake were creating temporary shelters using blue tarps, and these blue tarps would be good indicators of where the displaced persons were – if only they could be located in time, out of the thousands of images that would be collected every day. The problem was that there was no way for aid workers to search the thousands of images in time to find the tarps and communicate the locations back to the rescue workers on the ground in time. The solution would be provided by data-mining algorithms, which could search the images far faster and more thoroughly (and accurately?) then humanly possible. The goal was to find an algorithm that could effectively search the images in order to locate displaced persons and communicate those locations rescue workers so they could help those who needed it in time.”
The goal of this project is therefore to “Use data from the actual data collection process was carried out over Haiti” and “test each of the algorithms [learned in the course] on the imagery data collected during the relief efforts made Haiti in response to the 2010 earthquake, and determine which method you will use to as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they can be provided food and water before their situations become unsurvivable.”
I also set out a couple of personal goals as I completed this project. The first was to push myself to exploit the R library “caret” (classification and regression training), which is known best for its robust train function which handles model training, feature selection, and optimization. I use this approach because the goal is to try several different modeling techniques. The second was to hone my skills reading in and visualizing data.
My general approach will be to treat “part 1” (performed earlier in the course) as a lab exercise for statistical modeling techniques. Therefore, conclusions I make about my model performance along F_Beta score (the metric I use to justify which method/technique is the best) will not provide any simulation of a real event. I will use part 2 as a simulation of the receipt of a real world dataset as well as an exercise in risk management. My assumption is that producing an understandable model is somewhat important, but even more important was to demonstrate that this model performs well on data it has never seen before (i.e., simulating the real world emergency) using a variety of techniques.
Related, an important assumption I made is that working with a sample of a larger preliminary dataset in the lab would not have a major effect on the result. I did so for the sake of time while experimenting with different parameters. In the second part, I did retrain and refit my models on the hold out set to be sure the models fit to the whole dataset were similarly summarized as those fit to just the sample. This was always the case.
My conclusion is that my lab setting with more limited parameters did not emulate the real world event very well (I found inconsistent results for certain methods), but the model I produced was robust enough to perform well. The performance of these techniques (and improvement from lab to real) are demonstrated throughout. I found that a more traditional technique, binary logistic regression, which aimed to categorize the images by taking the log-odds that they will be in either one or the other class, performs just as well, if not better, than more modern machine learning techniques. My hypothesis that the top performer would be linear discriminate analysis, which aims to define a linear decision boundary between images from a color-based, was not correct. This goes to show that considering the problem statement itself is very important, but validating assumptions is even more so.
In the prompt for part 1, I was asked to: * (a) provide a thresholds used * (b) explain and justify rationale [for those thresholds]
In this classification problem, people’s lives are at stake. I want tests to be as accurate as possible and I want the relief delivery to be as efficient as possible. The latter part of that statement is important, because if time were no object, I could err on the side of delivering all of the resources (which are probably also finite) to everyone whether they needed them or not. But that isn’t the case.
In other words, in addition to high recall, I also need to look at precision to some degree; if I am delivering items to non-tarps just because i can, then I might be wasting a lot of time. So this becomes a threshold problem where I am focused on both recall (finding blue t) and precision (reducing false discovery for new predictions). Looking at this trade-off in terms of meaningful diagnostic or score will be the way I will approach setting my threshold in Step 7.
When I produced part 1 of this analysis, I used non-tarp as the positive class. Professor Schwartz clarified with me that this was a blunder, not because one absolutely must relevel classes, but because in a prediction context like this one, it changes what is predicted; the task I was inadvertently performing in part 1 was maximizing recall of non-tarps. This is not the same as solving the problem of identifying blue tarps, because the threshold established for scoring was using the much larger non-tarp class. It is important that the models perform well in solving the much different problem of recalling (and doing so precisely) with a small subset of blue tarps from the larger pool of non-tarps. Fortunately, I found that the same model performs very well for that purpose after releveling my classes.
I used the “reactable” package to apply some sorting and aggregation.
haiti<-read.csv('HaitiPixels.csv',header=TRUE,sep=",")
dim(haiti)
#> [1] 63241 4
sapply(haiti, function(x) sum(is.na(x)))
#> Class Red Green Blue
#> 0 0 0 0
#https://glin.github.io/reactable/articles/examples.html#multiple-groups-1
reactable(haiti, groupBy = "Class", columns = list(
Red = colDef(aggregate = "mean"), Green = colDef(aggregate = "mean"),Blue = colDef(aggregate = "mean")))I can see from the results that: * There are no missing records or N/A values in the set. * There are 4,744 “Various Non-Tarp” and 9,903 rooftops that will create a challenge because of how close the red and green values are to the respective values for blue tarps. * There aren’t many images classified as blue tarps. 2,022 out of ~63,200 or ~3.2%. Thinking about this critically, this is a more limited dataset than I want to be working with when it comes time for cross-validation using test/train sets. Therefore, a high k-folds (10) seems appropriate to reduce sampling variability. * The Blue value for blue is DISTINCTLY HIGHER for blue tarp than other classes. This makes sense, but is it enough to think about dropping the red and green features from the model? As we will show, probably not.
I am going to create a test dataset to use for testing in Step 8 using caret’s “createDataPartition” function. I don’t really need to do this here, because
set.seed(2) #set random seed
haiti <- tibble::rowid_to_column(haiti, "ID") %>% mutate(binary=ifelse(Class=="Blue Tarp","Blue.Tarp","NonTarp"))
haiti$binary<-relevel(as.factor(haiti$binary),ref="NonTarp") #releveling with Blue Tarp as the positive class
contrasts(haiti$binary)
#> Blue.Tarp
#> NonTarp 0
#> Blue.Tarp 1
haiti_test <- createDataPartition(y=haiti$binary, p=0.2, list=FALSE) #I am keeping 80% of data in the dataset because I want really good training, but I want enough to have a couple hundred blue tarps in the test set.
haiti.train <- haiti[-haiti_test,]
haiti.test<-haiti[haiti_test,]In the chunk above I also releveled my factor classes to that the reference class was “NonTarp” and the positive class was BlueTarp.
Now I will check to make sure my training set was pretty reflective of the original set.
In this step, I am going to examine a basic linear model and also try to confirm that I cannot get rid of any predictors.
Is there multicollinearity in the basic model that looks at red, green and blue?
basicmod<-glm(binary~Red+Blue+Green,data=haiti,family=binomial)
vif(basicmod)
#> Red Blue Green
#> 224.0059 371.0834 259.5066
pairs(haiti[,3:5],lower.panel=NULL) Of course. These values are derived as complements to each other, to describe/replicate the color of the images. But given the array of images I’m dealing with, do I need all three of them? Do they yield useful information in combination? I should explore this by looking at the interaction between features to model the presence of a blue tarp.
mod2<-glm(binary~(Red+Green+Blue)^2,data=haiti,family=binomial)
mod2.summary <- mod2 %>% broom::tidy()
mod2.summary
#> # A tibble: 7 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -11.5 0.886 -13.0 1.16e-38
#> 2 Red 0.0564 0.0368 1.53 1.25e- 1
#> 3 Green -0.190 0.0502 -3.79 1.52e- 4
#> 4 Blue 0.291 0.0459 6.33 2.38e-10
#> 5 Red:Green -0.00111 0.000358 -3.11 1.85e- 3
#> 6 Red:Blue -0.000620 0.000334 -1.86 6.33e- 2
#> 7 Green:Blue 0.00125 0.000233 5.37 7.75e- 8
summary(mod2) #base r summary shows p-values with significance levels
#>
#> Call:
#> glm(formula = binary ~ (Red + Green + Blue)^2, family = binomial,
#> data = haiti)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.3660 -0.0271 -0.0071 0.0000 2.9738
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.152e+01 8.856e-01 -13.004 < 2e-16 ***
#> Red 5.639e-02 3.676e-02 1.534 0.125018
#> Green -1.903e-01 5.025e-02 -3.787 0.000152 ***
#> Blue 2.911e-01 4.595e-02 6.335 2.38e-10 ***
#> Red:Green -1.114e-03 3.577e-04 -3.114 0.001845 **
#> Red:Blue -6.199e-04 3.338e-04 -1.857 0.063287 .
#> Green:Blue 1.254e-03 2.334e-04 5.373 7.75e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 17901.6 on 63240 degrees of freedom
#> Residual deviance: 1327.7 on 63234 degrees of freedom
#> AIC: 1341.7
#>
#> Number of Fisher Scoring iterations: 13It seems like I can probably get rid of Red and the interaction between Red and Blue. I will use this as the starting point to a regsubsets selection below.
mod3<-binary~(Green+Blue+Red*Green+Green*Blue)
regfit.full <- regsubsets(mod3, data=haiti)
reg.summary <- regfit.full %>% broom::tidy()
reg.summary %>% dplyr::select(r.squared)
#> # A tibble: 5 x 1
#> r.squared
#> <dbl>
#> 1 0.0559
#> 2 0.412
#> 3 0.445
#> 4 0.465
#> 5 0.465
RSSs <- tibble(RSS=regfit.full$rss) %>%
ggplot(aes(x=seq_along(RSS), y=RSS)) + geom_line() +
xlab("Number of Variables")
adjR2s <- reg.summary %>%
ggplot(aes(x=seq_along(adj.r.squared), y=adj.r.squared)) + geom_line() +
geom_point(reg.summary, color='red', size=5,
mapping=aes(x=which.max(adj.r.squared), y=max(adj.r.squared))) +
xlab("Number of Variables")
mallows_cps <- reg.summary %>%
ggplot(aes(x=seq_along(mallows_cp), y=mallows_cp)) + geom_line() +
geom_point(reg.summary, color='red', size=5,
mapping=aes(x=which.min(mallows_cp), y=min(mallows_cp))) +
xlab("Number of Variables")
BICs <- reg.summary %>%
ggplot(aes(x=seq_along(BIC), y=BIC)) + geom_line() +
geom_point(reg.summary, color='red', size=5,
mapping=aes(x=which.min(BIC), y=min(BIC))) +
xlab("Number of Variables")
grid.arrange(RSSs, adjR2s, mallows_cps, BICs, nrow=2) #3 degree modelcoef(regfit.full,4)
#> (Intercept) Blue Red Green:Red Green:Blue
#> 8.230526e-01 1.450118e-02 -8.241326e-03 2.429649e-05 -4.282234e-05
#Forward Selection
regfit.fwd <- regsubsets(mod3,data=haiti, method="forward")
RSSsForward <- tibble(RSS=regfit.fwd$rss) %>% #residual sum of squares - pick lowest
ggplot(aes(x=seq_along(RSS), y=RSS)) + geom_line() +
xlab("Number of Variables")
regfit.fwd <- regfit.fwd %>% broom::tidy()
adjR2sForward <- regfit.fwd %>%
ggplot(aes(x=seq_along(adj.r.squared), y=adj.r.squared)) + geom_line() +
geom_point(regfit.fwd, color='red', size=5,
mapping=aes(x=which.max(adj.r.squared), y=max(adj.r.squared))) +
xlab("Number of Variables - Forward Selection")
mallows_cpsForward <- regfit.fwd %>%
ggplot(aes(x=seq_along(mallows_cp), y=mallows_cp)) + geom_line() +
geom_point(regfit.fwd, color='red', size=5,
mapping=aes(x=which.min(mallows_cp), y=min(mallows_cp))) +
xlab("Number of Variables - Forward Selection")
BICsForward <- regfit.fwd %>%
ggplot(aes(x=seq_along(BIC), y=BIC)) + geom_line() +
geom_point(regfit.fwd, color='red', size=5,
mapping=aes(x=which.min(BIC), y=min(BIC))) +
xlab("Number of Variables - Forward Selection")
grid.arrange(RSSsForward, adjR2sForward, mallows_cpsForward, BICsForward, nrow=2)
#Backwards Elimination
regfit.bwd <- regsubsets(mod3, data=haiti, method="backward")
RSSsBackwards <- tibble(RSS=regfit.bwd$rss) %>% #residual sum of squares - pick lowest
ggplot(aes(x=seq_along(RSS), y=RSS)) + geom_line() +
xlab("Number of Variables")
regfit.bwd <- regfit.bwd %>% broom::tidy()
adjR2sBackwards <- regfit.bwd %>%
ggplot(aes(x=seq_along(adj.r.squared), y=adj.r.squared)) + geom_line() +
geom_point(regfit.bwd, color='red', size=5,
mapping=aes(x=which.max(adj.r.squared), y=max(adj.r.squared))) +
xlab("Number of Variables - Backwards Elimination")
mallows_cpsBackwards <- regfit.bwd %>%
ggplot(aes(x=seq_along(mallows_cp), y=mallows_cp)) + geom_line() +
geom_point(regfit.bwd, color='red', size=5,
mapping=aes(x=which.min(mallows_cp), y=min(mallows_cp))) +
xlab("Number of Variables - Backwards Elimination")
BICsBackwards <- regfit.bwd %>%
ggplot(aes(x=seq_along(BIC), y=BIC)) + geom_line() +
geom_point(regfit.bwd, color='red', size=5,
mapping=aes(x=which.min(BIC), y=min(BIC))) +
xlab("Number of Variables - Backwards Elimination")
grid.arrange(RSSsBackwards, adjR2sBackwards, mallows_cpsBackwards, BICsBackwards, nrow=2) I note from the exhibit above that Green:Blue, the fifth feature added (or first candidate for removal in backward selection), is not really improving any of the selection criteria (Adj R^2, Mallows CP, and BIC stays flat). I also tried this with AIC and got the same result. I might consider removing the Green:Blue term from the model, but since I am looking for the highest degree of accuracy possible, for now I will leave it in.
I am not sure that I need to do any model transformations here. I have a fairly interpretable model and I do not have to worry about things like heteroskedasticity (since this is binary logistic regression and we know there are imbalanced classes). One thing I might like to explore with more time would be QQ-plots of the data after I put it through the logit function.
The results from my “selection methods” may or may not been improved by using ridge/lasso with a training dataset and k-folds. I tried this method below, but it was not clear that the data needed to be regularized.
#set.seed(9)
# https://www.datacareer.ch/blog/ridge-and-lasso-in-r/
#ridgeCV <- caret::train(mod3, data=haiti,
# method="glmnet",
# tuneGrid=data.frame(alpha=rep(0,50), #for alpha = 0 to 100
# lambda=10^seq(4,0,length=100)), #select 100 values of lambda from 10^4 to 10^0
# trControl=trainControl("cv", number=10, returnResamp='all',allowParallel=TRUE))
#coef(ridgeCV$finalModel, ridgeCV$bestTune$lambda) %>% length() #it has all coefficients
#lassoCV1 <- caret::train(mod3, data=haiti.train,
# method="glmnet",
# tuneGrid=data.frame(alpha=rep(1,100),
# lambda=10^seq(4,-2,length=100)),
# trControl=trainControl("cv", number=10, returnResamp='all',allowParallel=TRUE))
#coef(lassoCV1$finalModel, lassoCV1$bestTune$lambda) %>% length() #it has all coefficientsBy this step I have a model that I feel fairly confident in. But before approaching the more specific challenges of determining which statistical modeling technique is most accurate or most robust, I want to try and develop a sense of where to begin in applying the thresholds. I already know what I want out of the model (high recall and relatively high precision), but I would like to quickly visualize false positives and negatives using a few models.
For the k-nearest neighbors (knn) method, I need to first find a good value for k. I can try different models out with a varying number of nearest neighbors.
At first I ran KNN with k=1:50. Fortunately, it seemed like 50 was around when the algorithm started to produce the same result (the limit of improvement when adding more neighbors). The best value that caret chose for the model was 8, but I notice that this is even-numbered. For a binary classification problem, this indicates that the value was at least partially derived by breaking ties when there were equidistant nearest-neighbors. We might not be able to rely on how this k will behave moving forward for that reason (though here is a great discussion from a data scientist with alternate ideas for addressing this problem).
Larger values of k were also not better performers of small values as far as I could tell. So I looked at a smaller section in the code below and used only odd numbers.
set.seed(1001) #for synchronizing bootstrap resampling precisely across different methods
classifier.knn<-caret::train(mod3, data=haiti, method="knn",
preProcess=c("center","scale"),
tuneGrid=data.frame(k=seq(1,30,2)),
trControl = caret::trainControl("cv", number=10,
returnResamp='all', savePredictions='final',
classProbs=TRUE, allowParallel=TRUE)) #allow parallel = true
classifier.knn
#> k-Nearest Neighbors
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'NonTarp', 'Blue.Tarp'
#>
#> Pre-processing: centered (5), scaled (5)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56916, 56917, 56916, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 1 0.9966161 0.9450928
#> 3 0.9971538 0.9540177
#> 5 0.9972644 0.9559626
#> 7 0.9971063 0.9534288
#> 9 0.9972012 0.9550240
#> 11 0.9972486 0.9558389
#> 13 0.9971221 0.9537062
#> 15 0.9970747 0.9529113
#> 17 0.9970273 0.9522126
#> 19 0.9970431 0.9524196
#> 21 0.9970273 0.9521188
#> 23 0.9969798 0.9513384
#> 25 0.9969640 0.9510338
#> 27 0.9969482 0.9507544
#> 29 0.9969324 0.9504876
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.This time, k of 5 was chosen. Being a very low value this will save computation time when it comes time to act on the hold-out set. However, we might want to check the variability before making more assumptions.
With my initial KNN output I can also take a random value for the precision-recall trade-off (I chose 0.8), and assess that by looking at accuracy and sorting through the results in terms of false positives and false negatives.
classifier.knn$pred <- classifier.knn$pred%>%
dplyr::mutate(pred2 = ifelse(classifier.knn$pred$Blue.Tarp>0.8,'Blue.Tarp','NonTarp')) %>%
dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
cm_k<-classifier.knn$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=classifier.knn$pred$pred2,reference=classifier.knn$pred$obs,positive='Blue.Tarp',mode='prec_recall'))
reactable(classifier.knn$pred,groupBy='Resample',columns = list(k = colDef(aggregate = "mean")))
get_metrics_func(cm_k, 'accuracy',mean)
#> Error in get_metrics_func(cm_k, "accuracy", mean): could not find function "get_metrics_func"
get_metrics_func(cm_k, 'accuracy', sd)
#> Error in get_metrics_func(cm_k, "accuracy", sd): could not find function "get_metrics_func"I used “returnResamp=”all" in caret’s wrapper to return all predictions which allows me to check the standard deviation across folds and average all the fold predictions together, as demonstrated in the step above.
At this step I know I am going to need to create models that can be efficiently tuned and run. I also know that I’m going to have to run them and score them again at the end.
So in this section, I have developed and will run through functions which: 1. use a combination of k-folds validation, which creates 10 training and validation sets (within the current training set) 2. tabulate a confusion matrix for each method and averages metrics like % of tarps recalled, precision, and accuracy, 3. using various tuning parameters, create a precision-recall curve and returns the area under that curve, 4. score the outcome based on F_beta score, as discussed at the beginning of the document. I will do this in Section 7.
In this part I develop functions to fit traditional classification methods logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and k-nearest neighbors. The functions averages out-of-folds recall and precision terms for each of the 10 folds, so that I may reduce sampling variability before making conclusions about the overall accuracy of my model. K-folds provides the advantage of having multiple, cross-validated sets to work with and working with caret output allows me to produce averages among them.
beta<-5
set.seed(9)
sample<-sample_n(haiti,20000,replace=FALSE)
#2. Function to pull a metric like sensitivity or recall out of a caret confusion matrix object
get_metric <- function(caret_CM_object, metric){
caret_CM_object %>% broom::tidy() %>%
dplyr::filter(term==metric) %>%
dplyr::select(estimate) %>% pull()
}
#map the metric pull function across all CMs created by each fold
get_metrics_func <- function(caret_CM_objects, metric, func){
caret_CM_objects %>%
purrr::map( ~ get_metric(.x, metric )) %>%
unlist() %>% func
}
#3. Creates a precision recall curve
#https://www.rdocumentation.org/packages/PRROC/versions/1.3.1/topics/pr.curve
curve<-function(scorespositive,scoresnegative,method){
pr_AUC<- pr.curve(scorespositive,scoresnegative,curve=TRUE)
plot(pr_AUC)
}
#Logistic Regression logit function
#inverse_logit <- function(x) {
# 1 / (1 + exp(-x))
#}
#don't know if I need this - caret seems to do this automatically with "family=binomial"??
thresholdinglog<-function(threshold){
glm.fit2 <- caret::train(mod3, data=sample,
preProcess=c("center","scale"),
trControl = caret::trainControl("cv", number=2, #k-folds cross-validation
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE),method="glm",family="binomial")
#glm.logOdds_predictions<-glm.fit2$pred %>% mutate(.predicted_probability = inverse_logit(glm.fit2$pred$Blue.Tarp)) #not using log odds, caret seems to do this automatically
glm.fit2 <- glm.fit2$pred %>% mutate(.prediction=ifelse(glm.fit2$pred$Blue.Tarp>threshold,"Blue.Tarp","NonTarp")) %>% #logit of the threshold to match logit of the predictions
mutate(.predictions=fct_relevel(as_factor(.prediction), c('NonTarp', 'Blue.Tarp')))
confmat<-glm.fit2 %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=glm.fit2$.predictions,reference=glm.fit2$obs,positive='Blue.Tarp',mode='prec_recall'))
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
AUC<-curve(glm.fit2$Blue.Tarp<threshold,glm.fit2$Blue.Tarp>threshold)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,threshold)
metricframe
}
#K-Nearest Neighbors
thresholding_knn<-function(threshold){
glm.fit <- caret::train(mod3, data=sample, method="knn",
preProcess=c("center","scale"),
tuneGrid=data.frame(k=5),
trControl = caret::trainControl("cv", number=10, #k-folds cross-validation
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE))
glm.fit$pred <- glm.fit$pred%>%
dplyr::mutate(pred2 = ifelse(glm.fit$pred$Blue.Tarp>threshold,'Blue.Tarp','NonTarp')) %>%
mutate(.predictions=factor(pred2, c('Blue.Tarp', 'NonTarp')))
confmat<-glm.fit$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=glm.fit$pred$.predictions,reference=glm.fit$pred$obs,positive='Blue.Tarp',mode='prec_recall'))
Recall<-get_metrics_func(confmat, 'recall', mean)
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
AUC<-curve(glm.fit$pred$Blue.Tarp<threshold,glm.fit$pred$Blue.Tarp>threshold)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,threshold)
metricframe
}
#QDA and LDA
thresholding_lda <- function(threshold){
glm.fit3<-caret::train(mod3, data=sample, method="lda",
preProcess=c("center","scale"),
trControl=trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE)) #10 folds cross-validation
glm.fit3$pred <- glm.fit3$pred%>%
dplyr::mutate(pred2 = ifelse(glm.fit3$pred$Blue.Tarp>threshold,'Blue.Tarp','NonTarp')) %>%
dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
confmat<-glm.fit3$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=.x$pred2,reference=.x$obs,positive='Blue.Tarp',mode='prec_recall'))
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
AUC<-curve(glm.fit3$pred$Blue.Tarp<threshold,glm.fit3$pred$Blue.Tarp>threshold)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,threshold)
metricframe
}
thresholding_qda <- function(threshold){
glm.fit3<-caret::train(mod3, data=sample, method="qda",
preProcess=c("center","scale"),
trControl=trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE)) #10 folds cross-validation
glm.fit3$pred <- glm.fit3$pred%>%
dplyr::mutate(pred2 = ifelse(glm.fit3$pred$Blue.Tarp>threshold,'Blue.Tarp','NonTarp')) %>%
dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
confmat<-glm.fit3$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=.x$pred2,reference=.x$obs,positive='Blue.Tarp',mode='prec_recall'))
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
AUC<-curve(glm.fit3$pred$Blue.Tarp<threshold,glm.fit3$pred$Blue.Tarp>threshold)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,threshold)
metricframe
}In this part I fit models using support vector machines, which transform the model to higher dimensions using a kernel function, and random forests, which apply and average a series of decision trees based on model features and average the ensemble results before fitting the model.
I tried support vector machines using radial basis, linear, and polynomial kernels.
set.seed(9)
thresholdingsvm <- function(threshold){
caret.svm.linear <- caret::train(mod3,data=sample, method='svmLinear', scale=FALSE,
tuneGrid=data.frame(C=14),
trControl=caret::trainControl("cv", number=10,
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE))
caret.svm.linear$pred
caret.svm.linear$pred <- caret.svm.linear$pred %>%
dplyr::mutate(pred2 = ifelse(caret.svm.linear$pred$Blue.Tarp>threshold,'Blue.Tarp','NonTarp')) %>%
dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
confmat<-caret.svm.linear$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=as.factor(caret.svm.linear$pred$pred2),reference=as.factor(caret.svm.linear$pred$obs),positive='Blue.Tarp',mode='everything'))
confmat
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
AUC<-curve(caret.svm.linear$pred$Blue.Tarp<threshold,caret.svm.linear$pred$Blue.Tarp>threshold)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,threshold)
metricframe
#caret.svm.radial <- caret::train(mod3,data=sample, method='svmRadial', scale=FALSE,
# tuneGrid=data.frame(C=seq(1,10,2),sigma=seq(0.1,1,0.2)),
# trControl=caret::trainControl("cv", number=10,
# returnResamp='all', savePredictions='final',
# classProbs=TRUE,allowParallel=TRUE))
#caret.svm.radial$pred
# caret.svm.radial$pred <- caret.svm.radial$pred%>%
# dplyr::mutate(preds = ifelse(caret.svm.radial$pred$Blue.Tarp>threshold,'Blue.Tarp','NonTarp')) %>%
# dplyr::mutate(pred2 = factor(preds,levels=c('Blue.Tarp','NonTarp')))
#confmat<-caret.svm.radial$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=caret.svm.radial$pred$pred2,reference=caret.svm.radial$pred$obs,positive='Blue.Tarp',mode='prec_recall'))
#caret.svm.poly <- caret::train(mod3,data=sample, method='svmPoly', scale=FALSE,
# tuneGrid=data.frame(C=1,scale=1,degree=1),
# trControl=caret::trainControl("cv", number=10,allowParallel=TRUE))
#caret.svm.poly
# caret.svm.poly$pred <- caret.svm.poly$pred%>%
# dplyr::mutate(pred2 = ifelse(caret.svm.poly$pred>threshold,'Blue.Tarp','NonTarp')) %>%
# dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
#caret.svm.poly <- caret::train(mod3,data=sample, method='svmLinear', scale=FALSE,
# tuneGrid=data.frame(C=1,sigma=1),
# trControl=caret::trainControl("cv", number=10,
# returnResamp='all', savePredictions='final',
# classProbs=TRUE,allowParallel=TRUE))
#caret.svm.poly$pred
# caret.svm.poly$pred <- caret.svm.poly$pred%>%
# dplyr::mutate(preds = ifelse(caret.svm.poly$pred$Blue.Tarp>0.8,'Blue.Tarp','NonTarp')) %>%
# dplyr::mutate(pred2 = factor(preds,levels=c('Blue.Tarp','NonTarp')))
#confmat<-caret.svm.poly$pred %>% group_split(Resample) %>% #purrr::map(~caret::confusionMatrix(data=caret.svm.poly$pred$pred2,reference=caret.svm.poly$pred$obs,positive='Blue.Tarp',mode='prec_recall'))
}I will vary the cost parameter above in the next section.
Random forests construct decision trees and select the class that is the average prediction of individual trees. The tuning parameter that is important in this case is the number of trees selected.
set.seed(3)
thresholdingrf <- function(trees){
glm.fit <- caret::train(mod3, data=sample, method='rf',
importance=FALSE, ntree=15,
tuneGrid=data.frame(mtry=2), #all features are selected for bagging
trControl=caret::trainControl("cv", number=10,
savePredictions=TRUE,
returnResamp='all',classProbs=TRUE,allowParallel = TRUE))
glm.fit$pred <- glm.fit$pred%>%
dplyr::mutate(pred2 = ifelse(glm.fit$pred$Blue.Tarp>0.4,'Blue.Tarp','NonTarp')) %>%
dplyr::mutate(pred2 = factor(pred2,levels=c('Blue.Tarp','NonTarp')))
confmat<-glm.fit$pred %>% group_split(Resample) %>% purrr::map(~caret::confusionMatrix(data=.x$pred2,reference=.x$obs,positive='Blue.Tarp',mode='prec_recall'))
Recall<-get_metrics_func(confmat, 'recall', mean)
Stdev_Recall=get_metrics_func(confmat, 'recall', sd)
Precision<-get_metrics_func(confmat, 'precision', mean)
Stdev_Precision<-get_metrics_func(confmat, 'precision', sd)
Specificity<-get_metrics_func(confmat, 'specificity', mean)
Stdev_Spec<-get_metrics_func(confmat, 'specificity', sd)
FDR<-1-Precision
#curve(glm.fit$pred$Blue.Tarp<0.4,glm.fit$pred$Blue.Tarp>0.4)
AUC<-curve(glm.fit$pred$Blue.Tarp<0.4,glm.fit$pred$Blue.Tarp>0.4)
Score<-((beta+1)^2)*(Precision*Recall)/(beta^2*Precision)+Recall*(Precision/Recall)*(beta)
Score_Dev<-((beta+1)^2)*((Stdev_Precision)*Stdev_Recall)/(beta^2*(Stdev_Precision))+Stdev_Recall*(Stdev_Precision)/(Stdev_Recall)*(beta)
metricframe<-data.frame(Recall,Stdev_Recall,Precision,Stdev_Precision,'Accuracy'=get_metrics_func(confmat, 'accuracy', mean),'Stdev_Accuracy'=get_metrics_func(confmat, 'accuracy', sd),Specificity,Stdev_Spec,FDR,Score,Score_Dev,'Threshold'=0.4)
metricframe
}The final model from running random forest should also be reviewed from a feature importance standpoint, to consider whether features could be removed (treating m as a tuning parameter).
# https://stats.stackexchange.com/questions/162465/in-a-random-forest-is-larger-incmse-better-or-worse
# http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
# https://ggplot2.tidyverse.org/reference/geom_bar.html
set.seed(19)
glm.fit <- caret::train(mod3, data=sample, method='rf',
importance=FALSE, ntree=15,
tuneGrid=data.frame(mtry=2), #all features are selected for bagging
trControl=caret::trainControl("cv", number=10,
savePredictions=TRUE,
returnResamp='all',classProbs=TRUE,allowParallel = TRUE))
importance1 <- sort(randomForest::importance(glm.fit$finalModel)[,])
importance1 %>% bind_rows() %>% tidyr::pivot_longer(cols=names(importance1)) %>%
dplyr::rename(`relative importance`=value) %>%dplyr::mutate(name = factor(name, levels=name)) %>%
ggplot(aes(y=name, x=`relative importance`)) + geom_bar(stat='identity', orientation="y") I can see that blue is the most important feature, not surprisingly. It is interesting that Green on its own seems to matter relatively little in the decision. However, if removed from the model, the cross-validation test error increased a bit and since I am looking for a high degree of accuracy, I will leave in. I will discuss below what varying “mtry” (number of features used at each model) did for my model.
I will iterate through precision-recall trade-off thresholds between 0 and 1 in an attempt to find the highest accuracy (fewest errors) as possible. Since k-folds validation is already handling the train/test framework, I can do this using the full preliminary dataset.
#https://purrr.tidyverse.org/reference/map2.html
#set.seed(11)
#map(seq(0,0.99,0.09),thresholding_lda) #lda
#map(seq(0,0.99,0.09),thresholding_qda) #qda#
#map(seq(0,0.99,0.09),thresholdinglog) #logistic
#map(seq(0,0.99,0.09),thresholding_knn) #knn
#seq(0.81,0.99,0.03) %>% map(thresholdingsvm)
#seq(0,0.99,0.09) %>% map(thresholdingrf)
#seq(700,1100,50) %>% map(thresholdingrf) #random forest iterated over several decision trees with a value of 4; best value here was 0.9966 at n=900Findings Lower thresholds generally produced higher accuracy; average accuracy peaks at thresholds 0.2 and 0.4 precision-recall threshold, with the exception of QDA.
Traditional Classifiers Logistic Regression: Very low variability with a threshold of around 0.11, but highest accuracy was achieved at 0.2. Highest accuracy achieved was around 0.996. LDA:: The highest accuracy is found around 0.2. Highest accuracy achieved was around 0.993. QDA: The best highest accuracy is found at a threshold of 0.99. Highest accuracy achieved was around 0.996. KNN: The best score is found around 0.4, accuracy achieved was 0.996.
Random Forest The highest accuracy is found with a threshold of around 0.4. This seemed to be consistently true regardless of the number of decision trees. After holding this value constant I optimized the number of decision trees used.
Initially it seemed to be that around 1000 trees yielded the highest accuracy. But this is obviously not ideal in terms of an interpretable model. When I looked at values below 50 (my original starting point), I was surprised to find that many values of n (number of trees) performed just as well in terms of accuracy. In fact, a random forest with just two trees performed well when considering standard deviation in accuracy (accuracy = 0.996; stdev = 0.0008). However, this model might be risky to take forward on a different dataset due to bias. Therefore, I am selecting a model with n=15 decision trees which has comparably high accuracy values (0.9965; stdev=0.0011).
I also tried different values of “mtry”, the number of parameters used at each split for classification. I was able to get a more interpretable model by just using 2 parameters at each decision. This means I have a reasonably interpretable model and does not present as much of a problem for bias considering I will have the same features in my hold-out dataset.
Support Vector Machines Support vector machines have different parameters depending on their type. I will discuss the tuning parameters I selected below and how to move forward with the results.
Linear kernel: The linear kernel “the linear kernel essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation” (James, 2013). What this parameter does is defines the flexibility of the model, i.e., a higher C penalizes misclassification error source. The tradeoff is getting a smooth decision surface, which may yield better results with the much larger holdout dataset. I will want to take a look at how well this parameter behaves in the final holdout set because I believe a linear kernel will model what I think should be a linear decision boundary very well. The tuning selected a cost parameter (C) of 14 using 110 support vectors. Caret’s final model selected a constant sigma (variance) value of 1, which means that the model is considering only points close to the linear decision boundary established discussed here.
Polynomial kernel: The polynomial kernel fits a support vector classifier to a vector space with more dimensions than are present in the model, which “leads to a much more flexible decision boundary” (James, 2013). It does not seem to me like polynomial kernels would be useful here, because I observed a very linear relationship among a fixed set of RGB values when I checked multicollinearity at the beginning. I did train the vector machine, but it did not yield good results and took a very long time to run for degrees > 2, so I will move forward without it.
Radial Basis kernel: Radial kernel’s have very “local behavior” (James, 2013), meaning when points are separated by a large distance, the kernel value is zero. In other words, we use sigma (variance) as a tuning parameter to decide how simliar two observations really area. This seems like it would be useful for this dataset, given we might have groups of color palettes that might be classified relative to one another. Unfortunately, this algorithm takes a very long time to run. I performed model tuning once on a small sample of data. The results were strikingly similar to those of the linear kernel, implying that maybe the radial basis kernel is not actually doing anything to work around potential problems created by a linear decision boundary. Further inspection revealed that the kernel was almost always classifying tarps correctly (very high precision) when accuracy was high. In addition to making results hard to interpret, accuracy tailed off when trying to achieve a more reasonable precision/recall tradeoff. The best value for accuracy was 0.978, found at threshold of 0.93.
Therefore, moving into part 7, I will only use linearSVM for scoring.
My threshold at the beginning of step 5 was set somewhat arbitrarily; I made it a higher value (probability of true blue tarp is high) because I thought this would maximize sensitivity (which is more of a priority for me because of how few tarps there are).
In step 6, I got more method-specific thresholds and it seems like that conclusion wasn’t right; lower threshold values resulted in better accuracy scores. But I have the power to tune this based on assumptions and goals of the effort. I laid out the assumption at the beginning that recall is more important than precision. The question I am faced with now is how much more valuable is reducing the probability that I deliver to the wrong person than reducing the number of misses overall? If we assume that this trade-off can be represented as a fraction, I would recommend by 5 to 1. Here is how I derived this:
There are 2,022 blue tarps in this dataset. The earthquake covered some 70 miles in diameter link. For one person to travel 70 miles (considering there is devastation, all-terrain vehicles may be necessary), this may take around 5 hours.
If someone in a temporary shelter (blue tarp) was not actually injured, we may have lost 5 hours providing relief to someone who actually was injured, 70 miles away.
I arrived at a fraction of 1 hours vs. these 5 hours by making a few assumptions:
In other words, predicting a false discovery on the outskirts wastes more resources than would be effectively administered by applying to 5 tarps close by and only getting one right.
In reality, I don’t know anyone’s condition or the quality of other ground intelligence, so trying to layer in too many more assumptions about collective safety and security on the island is an overwhelmingly difficult task. The main takeaway for the audience is that ‘precision’ is a vague concept to me at this stage and so I assume precision comes at a very high (unspecified) cost.
As an example, a high-recall model (using best performer logistic regression with threshold equal to 0.1) has shown me that about 1,000 of 1,600 people receive relief. High recall in that case clearly came at the price of lost precision. I codified my thinking into a variant of the F1 score (the “harmonic mean of precision and recall” link, a more general F_beta score to plug in my weighting. The equation is \(Fβ=(1+β)^2*(precision*recall)/(β^2*precision)+recall\)
In my case, beta is 5. I automated the F_beta scoring using the same functions from part 6.
I will now compute this F-Score across data frames for all six methods using the thresholds determined in the step prior, using a test dataset which simulate new predictions. To do this, I comment in lines of code that calculate the score at the end of each function.
#> Recall Stdev_Recall Precision Stdev_Precision Accuracy Stdev_Accuracy
#> 1 0.8614202 0.04247603 0.9360901 0.02377837 0.99345 0.001832411
#> Specificity Stdev_Spec FDR Score Score_Dev threshold
#> 1 0.9979831 0.0007492199 0.06390989 5.920896 0.1800573 0.2
thresholding_qda(0.99) #qda
#> Recall Stdev_Recall Precision Stdev_Precision Accuracy Stdev_Accuracy
#> 1 0.9171868 0.03033954 0.9492208 0.02509369 0.9955999 0.001197435
#> Specificity Stdev_Spec FDR Score Score_Dev threshold
#> 1 0.9982933 0.0009140025 0.05077924 6.066853 0.1691574 0.99
thresholdinglog(0.2) #logistic
#> Recall Stdev_Recall Precision Stdev_Precision Accuracy Stdev_Accuracy
#> 1 0.9638554 0 0.9103841 0 0.99565 0
#> Specificity Stdev_Spec FDR Score Score_Dev threshold
#> 1 0.9967418 0 0.08961593 5.939872 NaN 0.2
thresholding_knn(0.4) #knn
#> Recall Stdev_Recall Precision Stdev_Precision Accuracy Stdev_Accuracy
#> 1 0.9518072 0 0.9518072 0 0.9968 0
#> Specificity Stdev_Spec FDR Score Score_Dev threshold
#> 1 0.9983451 0 0.04819277 6.129639 NaN 0.4
thresholdingrf(15) #with random forests with 15 trees
#> Recall Stdev_Recall Precision Stdev_Precision Accuracy Stdev_Accuracy
#> 1 0.9442108 0.0277446 0.9458325 0.03072823 0.9963001 0.0009481338
#> Specificity Stdev_Spec FDR Score Score_Dev Threshold
#> 1 0.9980865 0.001195524 0.05416753 6.088826 0.1935934 0.4
thresholdingsvm(0.4) #linear
#> line search fails -0.0001063527 4.111006 5.02076e-05 -8.383592e-09 -3.158583e-15 8.853931e-11 -9.008623e-19line search fails -0.0004873761 1.359138 0.0002473538 -1.83398e-08 -1.046668e-13 -3.643704e-11 -2.522147e-17line search fails -0.0005289002 1.752723 0.001319092 6.622307e-08 -1.077308e-12 -2.756996e-09 -1.603644e-15line search fails -0.0004627463 3.835506 0.0004160318 -1.679632e-08 -2.591412e-13 7.80814e-10 -1.209258e-16line search fails -0.000410455 2.731797 0.005035071 -2.493397e-07 -2.185465e-12 1.217192e-09 -1.130746e-14line search fails -0.0003990366 3.919587 1.841976e-05 -7.166521e-09 -1.624513e-14 1.0322e-10 -1.038959e-18
#> Error in if (!is.null(sorted.scores.class1) & (length(sorted.scores.class0) != : missing value where TRUE/FALSE needed
Initially, I mapped various thresholds again to compare and contrast average and standard deviation of the cross-fold F_beta scores and other accuracy metrics. In most cases, the result showed that standard deviation was large enough that it pushed different methods past one another in terms of performance (competition was close). Therefore, I always picked the best average performer, but noted the standard deviation when known in the table below.
| Method | KNN (k=5) | QDA | LDA | Logistic Regression | Support Vector (C=14 linear) | Random Forest (m=2, n=15) |
|---|---|---|---|---|---|---|
| Accuracy | 0.9968 | 0.9956 | 0.9934 | 0.9956 | 0.993 | 0.9963 |
| AUC (P/R Curve) | 0.956 | 0.958 | 0.960 | 0.954 | 0.956 | 0.957 |
| Threshold | 0.4 | 0.99 | 0.2 | 0.2 | 0.4 | 0.4 |
| Sensitivity/Recall/Power | 0.952 | 0.917 | 0.861 | 0.963 | 0.821 | 0.944 |
| Specificity (1-FPR) | 0.998 | 0.9983 | 0.998 | 0.997 | 0.998 | 0.998 +/-0.0008 |
| FDR (1-Precision) | 0.058 | 0.051 | 0.064 | 0.09 | 0.022 +/- 0.03 | 0.054 |
| Precision (PPV) | 0.952 | 0.949 | 0.936 | 0.91 | 0.978 +/- 0.03 | 0.945 |
| F_5 Score | 6.12 | 6.06 | 5.92 | 5.93 | 5.95 | 6.08 |
In part 1, KNN, support vector machines, and random forest seem like the front runners. However, as I will elaborate on in the conclusion, there is simply too much variability to assert that one of these will outperform others on even a comparable dataset. Logistic regressions performs consistently well (no clear “weak” metric), and in terms of my subjective F-Score weighting surpasses KNN.
Part 2 will center around how and why I would tune values for beta to improve my recall and/or precision based on a real world situation.
In part 2, we got hold of the geographic data that we wished for in part 1. Now that we have it, there is more information to digest and I can refine the approach. This is referred to as the “hold-out” or “held-out” dataset.
The first thing to do is read in and clean the hold-out files. I skipped the first seven rows which contained information about the files. There was one file that had a small sample of values which I did not use as it did not have lat/long values and was therefore potentially duplicated data. In the real world, I would clarify with the provider whether this was an error.
#https://stackoverflow.com/questions/9564489/read-all-files-in-a-folder-and-apply-a-function-to-each-data-frame
#https://lieselspangler.com/2017/11/16/how-to-load-and-append-multiple-files-in-r/
#https://stackoverflow.com/questions/16979858/reading-text-file-with-multiple-space-as-delimiter-in-r
bluefilenames <- list.files("~/Documents/MSDS/Data Mining SYS 6018/Hold+Out+Data/Blue", pattern="*.txt", full.names=TRUE)
blue <- bind_rows(lapply(bluefilenames,read_table,na="NA",skip=7))
nbfilenames <- list.files("~/Documents/MSDS/Data Mining SYS 6018/Hold+Out+Data/NB", pattern="*.txt", full.names=TRUE)
nonblue <- bind_rows(lapply(nbfilenames,read_table,na="NA",skip=7))I note from the datasets generated that my Red-Green-Blue values are now B1,B2, and B3 respectively. I will fix this using the statement below.
blue<- rename(blue, Red = B1,Green = B2, Blue = B3)
nonblue<- rename(nonblue, Red = B1,Green = B2, Blue = B3)I created coordinate-based data frames to summarize the set of blue tarps I am working with in terms of lat/long. I learned, again from the reactable package, that:
set.seed(40)
blue<-blue %>% mutate(lat2=round(Lat,digits=5)) %>% mutate(lon2=round(Lon,digits=5))
bluecoord<-data.frame(blue[12:13])
nonblue<-nonblue %>% mutate(lat2=round(Lat,digits=5)) %>% mutate(lon2=round(Lon,digits=5))
nonbluecoord<-data.frame(nonblue[12:13])
reactable(bluecoord, groupBy = "lat2")Inspecting the underlying data (which shows out to more decimals than reactable output does), I also observed that a lot of these images are indistinguishably close together; meaning multiple images probably make up one tarp. More than 5 decimals of lat/lon is probably more than needed, because 5 digits of precision is roughly 1 meter at the equator, and I do not imagine blue tarps are smaller than that (plus the practical limit of location is at play at this level of precision link)
I grouped the values based on that level of precision below; this makes for a much more manageable dataset moving forward.
blue<- blue %>% mutate(latlong=paste0(lat2,", ",lon2))
nonblue<-nonblue %>% mutate(latlong=paste0(lat2,", ",lon2))
blue_grouped<-aggregate(blue[,c(7:11)],list(blue$latlong),mean)
nonblue_grouped<-aggregate(nonblue[,c(7:11)],list(nonblue$latlong),mean)I was curious what the range of the palette of images in the hold-out dataset is, because this is going to tell me how much of a challenge my model will have to identify blue.
#https://stackoverflow.com/questions/25726276/visualize-a-list-of-colors-palette-in-r
image(1:nrow(all), 1, as.matrix(1:nrow(all)),
col=rgb(all$B1,all$B2,all$B3,maxColorValue = 255),
xlab="", ylab = "", xaxt = "n", yaxt = "n", bty = "n")
#> Error in 1:nrow(all): argument of length 0I can see a very distinct sky-blue line all the way in the far left, which is encouraging. But here is what blue looks like by itself before I grouped observations together.
#https://stackoverflow.com/questions/25726276/visualize-a-list-of-colors-palette-in-r
image(1:nrow(blue), 1, as.matrix(1:nrow(blue)),
col=rgb(blue$Red,blue$Green,blue$Blue,maxColorValue = 255),
xlab="", ylab = "", xaxt = "n", yaxt = "n", bty = "n")And here’s what they looked like afterwards (not much less granular):
#https://stackoverflow.com/questions/25726276/visualize-a-list-of-colors-palette-in-r
image(1:nrow(blue_grouped), 1, as.matrix(1:nrow(blue_grouped)),
col=rgb(blue_grouped$Red,blue_grouped$Green,blue_grouped$Blue,maxColorValue = 255),
xlab="", ylab = "", xaxt = "n", yaxt = "n", bty = "n")
image(1:nrow(nonblue_grouped), 1, as.matrix(1:nrow(nonblue_grouped)),
col=rgb(nonblue_grouped$Red,nonblue_grouped$Green,nonblue_grouped$Blue,maxColorValue = 255),
xlab="", ylab = "", xaxt = "n", yaxt = "n", bty = "n") The good news is I think I can move forward with confidence with my truncated datasets given how comparable the blue palette is to the original set. But it is clear the models have their work cut out for them in terms of classification.
Finally, I’m going to combine the two data frames (blue and non-blue), and shuffle that data.
blue_grouped<-blue_grouped %>% mutate(binary="Blue.Tarp")%>% mutate(colr=rgb(Red,Green,Blue,maxColorValue=255))
nonblue_grouped<-nonblue_grouped %>% mutate(binary="NonTarp")%>% mutate(colr=rgb(Red,Green,Blue,maxColorValue=255))
all<-rbind(nonblue_grouped,blue_grouped)
#shuffling data: https://medium.com/@tyagi.sudarshini/how-to-shuffle-a-dataframe-in-r-by-rows-e7971cd7949e
rows<-sample(nrow(all))
all<-all[rows,]So my total dataset has 12,051 observations. This is pretty close to the size of the original test set I was using, so I know my models will at least run efficiently. Using this dataset plus the lat/long boundary that I found using react table, I can plot the points using a nifty html streetmap via “mapview” package, which confirms my suspicion at the end of part one: there are distinct neighborhoods, three to be exact.
#https://www.r-spatial.org/r/2018/10/25/ggplot2-sf-2.html ; used this for understanding how to plot coordinates
my.sf.point <- st_as_sf(x = all,
coords = c("Lon", "Lat"),
crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
#https://stackoverflow.com/questions/29736577/how-to-convert-data-frame-to-spatial-coordinates
my.sp.point <- as(my.sf.point,"Spatial")
#mapview(my.sp.point,cex=2)I’m noting that there are tarps crossing a river in the left-most neighborhood. So there may be some bad data.
Here is another visualization using ggplot. I like this because it is static and therefore knits better and still provides a lot of context via the scale.
#https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html
#this required downloading quite a few packages; libwgeom was deprecated for Rv4, equivalent is lwgeom
world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
#> [1] "sf" "data.frame"
ggplot(data = world) +
geom_sf() +
theme(panel.grid.major = element_line(color = gray(.5), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "aliceblue") )+
scale_fill_viridis_c(option = "plasma", trans = "sqrt") +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "tr", which_north = "true",
pad_x = unit(0.75, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering) +
ggtitle("Where are the Blue Tarps Concentrated?") +
coord_sf(xlim = c(-72.44, -72.36), ylim = c(18.5, 18.6)) + #adjusted coordinates to show Haiti
geom_point(data=blue_grouped,aes(x=Lon,y=Lat),size=0.1,shape=23,fill=blue_grouped$colr) We can see the tarps are located in only two of the three neighborhoods. In a real simulation, we might not have figured this out (we wouldn’t have a “blue” and “non-blue” dataset at this point, we’d just have a dataset). But I will address this finding in part 10.
So I’ve done my visual analysis, and now I’m pretending that I’ve deployed resources to each neighborhood; I don’t have to worry about losing so much time by traversing the entire area like I was before. My original F_beta rated recall 5 times higher than precision because I was operating in a “prediction precision” mindset. If a new observation was not precise, I was losing assuming the maximum amount of time was lost for every miss. Now I can fine tune that approach by focusing on the two neighborhoods I see above.
In this exercise, I will establish a budget of USD 300 per administration of aid to one tarp, or USD 3.615 million for the whole effort.
For purposes of simulation I will pretend neighborhood 1 (east neighborhood) is much more dense with tarps (it at least appears to be so from images), and that there are separate teams in the east and west neighborhoods. I will allocate this team with a budget that reflects the number of tarps.
There are 2,577 tarps in neighborhood 1, for a total budget of 773,100.
The risk to the team working in this neighborhood is lower for false discovery, and therefore I’ve assigned a lower cost to false positives and a higher one to false negatives; this assumes that “misses” towards the epicenter will waste time in the event of a contingency when resources have to be spread out.
| \(Real Non-Tarp\) | \(Real Tarp\) | |
|---|---|---|
| \(Predicted Non-Tarp\) | 0 | -1500 |
| \(Predicted Tarp\) | -300 | -300 |
In contrast, neighborhood 2 is more spread out. The risk to the “spread out” team is higher for false discovery; this assumes that “misses” towards the perimeter will waste time and travel resources. Therefore, this team must act efficiently and precisely.
There are 9,474 tarps in neighborhood 1, for a total budget of USD 2,842,200.
| \(Real Non-Tarp\) | \(Real Tarp\) | |
|---|---|---|
| \(Predicted Non-Tarp\) | 0 | -900 |
| \(Predicted Tarp\) | -900 | -300 |
A goal I mentioned in the beginning is a fair comparison for how the methods selected “in the lab” perform against real world data when taken out of the box. Therefore, I haven’t varied my tuning parameters from the first part. But with each method I will derive an extra data point, which is the hypothetical budget depleted as a result of using each method. This is essentially an assessment of performance on a smaller scale and against two possibly very different datasets.
set.seed(23)
glm.fit2 <- caret::train(mod3, data=haiti,
preProcess=c("center","scale"),
trControl = caret::trainControl("cv", number=2, #k-folds cross-validation
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE),method="glm",family="binomial")
glm.predictionslog <- predict(glm.fit2, all, type="prob")
glm.predictionslog <-glm.predictionslog%>% mutate(.prediction=ifelse(glm.predictionslog$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmat_log<-caret::confusionMatrix(as.factor(glm.predictionslog$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmat_log
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 208 22
#> NonTarp 1 11820
#>
#> Accuracy : 0.9981
#> 95% CI : (0.9971, 0.9988)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9466
#>
#> Mcnemar's Test P-Value : 3.042e-05
#>
#> Sensitivity : 0.9981
#> Specificity : 0.9952
#> Pos Pred Value : 0.9999
#> Neg Pred Value : 0.9043
#> Precision : 0.9999
#> Recall : 0.9981
#> F1 : 0.9990
#> Prevalence : 0.9827
#> Detection Rate : 0.9808
#> Detection Prevalence : 0.9809
#> Balanced Accuracy : 0.9967
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionslog$Blue.Tarp<0.2,glm.predictionslog$Blue.Tarp>0.2)set.seed(94)
glm.predictionslog1 <- predict(glm.fit2, all[all$neighborhood==1,], type="prob")
glm.predictionslog1 <-glm.predictionslog1 %>% mutate(.prediction=ifelse(glm.predictionslog1$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmatlog1<-caret::confusionMatrix(as.factor(glm.predictionslog1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1<-confmatlog1$table[1,2]*1500+confmatlog1$table[2,1]*300+confmatlog1$table[2,2]*300
budget_remaining_log1<-team1budget-depleted1
budget_remaining_log1
#> [1] -2400
glm.predictionslog2 <- predict(glm.fit2, all[all$neighborhood==2,], type="prob")
glm.predictionslog2 <-glm.predictionslog2 %>% mutate(.prediction=ifelse(glm.predictionslog2$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmatlog2<-caret::confusionMatrix(as.factor(glm.predictionslog2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2<-confmatlog2$table[1,2]*900+confmatlog2$table[2,1]*900+confmatlog2$table[2,2]*300
budget_remaining_log2<-team2budget-depleted2
budget_remaining_log2
#> [1] 42600
budget_remaining_log<-budget_remaining_log1+budget_remaining_log2
budget_remaining_log
#> [1] 40200KNN does reasonably well on the new dataset, with an average accuracy of 0.993.
set.seed(7)
#K-Nearest Neighbors
glm.fit1 <- caret::train(mod3, data=haiti, method="knn",
preProcess=c("center","scale"),
tuneGrid=data.frame(k=5),
trControl = caret::trainControl("cv", number=10, #k-folds cross-validation
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE))
glm.predictionsknn <- predict(glm.fit1, all, type="prob")
glm.predictionsknn <-glm.predictionsknn %>% mutate(.prediction=ifelse(glm.predictionsknn$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmat_knn<-caret::confusionMatrix(as.factor(glm.predictionsknn$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmat_knn
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 164 44
#> NonTarp 45 11798
#>
#> Accuracy : 0.9926
#> 95% CI : (0.9909, 0.9941)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.7828
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9963
#> Specificity : 0.7847
#> Pos Pred Value : 0.9962
#> Neg Pred Value : 0.7885
#> Precision : 0.9962
#> Recall : 0.9963
#> F1 : 0.9962
#> Prevalence : 0.9827
#> Detection Rate : 0.9790
#> Detection Prevalence : 0.9827
#> Balanced Accuracy : 0.8905
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionsknn$Blue.Tarp<0.4,glm.predictionsknn$Blue.Tarp>0.4)set.seed(96)
glm.predictionsknn1 <- predict(glm.fit1, all[all$neighborhood==1,], type="prob")
glm.predictionsknn1 <-glm.predictionsknn1 %>% mutate(.prediction=ifelse(glm.predictionsknn1$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatknn1<-caret::confusionMatrix(as.factor(glm.predictionsknn1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1knn<-confmatknn1$table[1,2]*1500+confmatknn1$table[2,1]*300+confmatknn1$table[2,2]*300
budget_remaining_knn1<-team1budget-depleted1knn
budget_remaining_knn1
#> [1] -24900
glm.predictionsknn2 <- predict(glm.fit1, all[all$neighborhood==2,], type="prob")
glm.predictionsknn2 <-glm.predictionsknn2 %>% mutate(.prediction=ifelse(glm.predictionsknn2$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatknn2<-caret::confusionMatrix(as.factor(glm.predictionsknn2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2knn<-confmatknn2$table[1,2]*900+confmatknn2$table[2,1]*900+confmatknn2$table[2,2]*300
budget_remaining_knn2<-team2budget-depleted2knn
budget_remaining_knn2
#> [1] 12300
budget_remaining_knn<-budget_remaining_knn1+budget_remaining_knn2
budget_remaining_knn
#> [1] -12600set.seed(5)
# LDA
glm.fit3<-caret::train(mod3, data=haiti, method="lda",
preProcess=c("center","scale"),
trControl=trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE)) #10 folds cross-validation
glm.predictionslda <- predict(glm.fit3, all, type="prob")
glm.predictionslda <-glm.predictionslda %>% mutate(.prediction=ifelse(glm.predictionslda$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmatlda<-caret::confusionMatrix(as.factor(glm.predictionslda$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmatlda
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 206 22
#> NonTarp 3 11820
#>
#> Accuracy : 0.9979
#> 95% CI : (0.9969, 0.9987)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9417
#>
#> Mcnemar's Test P-Value : 0.0003182
#>
#> Sensitivity : 0.9981
#> Specificity : 0.9856
#> Pos Pred Value : 0.9997
#> Neg Pred Value : 0.9035
#> Precision : 0.9997
#> Recall : 0.9981
#> F1 : 0.9989
#> Prevalence : 0.9827
#> Detection Rate : 0.9808
#> Detection Prevalence : 0.9811
#> Balanced Accuracy : 0.9919
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionslda$Blue.Tarp<0.2,glm.predictionslda$Blue.Tarp>0.2)
#QDA
glm.fit4<-caret::train(mod3, data=haiti, method="qda",
preProcess=c("center","scale"),
trControl=trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE)) #10 folds cross-validation
glm.predictionsqda <- predict(glm.fit4, all, type="prob")
glm.predictionsqda <-glm.predictionsqda %>% mutate(.prediction=ifelse(glm.predictionsqda$Blue.Tarp>0.99, "Blue.Tarp", "NonTarp"))
confmatqda<-caret::confusionMatrix(as.factor(glm.predictionsqda$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmatqda
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 152 20
#> NonTarp 57 11822
#>
#> Accuracy : 0.9936
#> 95% CI : (0.992, 0.995)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.7947
#>
#> Mcnemar's Test P-Value : 4.086e-05
#>
#> Sensitivity : 0.9983
#> Specificity : 0.7273
#> Pos Pred Value : 0.9952
#> Neg Pred Value : 0.8837
#> Precision : 0.9952
#> Recall : 0.9983
#> F1 : 0.9968
#> Prevalence : 0.9827
#> Detection Rate : 0.9810
#> Detection Prevalence : 0.9857
#> Balanced Accuracy : 0.8628
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionsqda$Blue.Tarp<0.99,glm.predictionsqda$Blue.Tarp>0.99)set.seed(97)
glm.predictionslda1 <- predict(glm.fit3, all[all$neighborhood==1,], type="prob")
glm.predictionslda1 <-glm.predictionslda1 %>% mutate(.prediction=ifelse(glm.predictionslda1$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmatlda1<-caret::confusionMatrix(as.factor(glm.predictionslda1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1lda<-confmatlda1$table[1,2]*1500+confmatlda1$table[2,1]*300+confmatlda1$table[2,2]*300
budget_remaining_lda1<-team1budget-depleted1lda
budget_remaining_lda1
#> [1] 4800
glm.predictionslda2 <- predict(glm.fit1, all[all$neighborhood==2,], type="prob")
glm.predictionslda2 <-glm.predictionslda2 %>% mutate(.prediction=ifelse(glm.predictionslda2$Blue.Tarp>0.2, "Blue.Tarp", "NonTarp"))
confmatlda2<-caret::confusionMatrix(as.factor(glm.predictionslda2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2lda<-confmatlda2$table[1,2]*900+confmatlda2$table[2,1]*900+confmatlda2$table[2,2]*300
budget_remaining_lda2<-team2budget-depleted2lda
budget_remaining_lda2
#> [1] 18600
budget_remaining_lda<-budget_remaining_lda1+budget_remaining_lda2
budget_remaining_lda
#> [1] 23400
glm.predictionsqda1 <- predict(glm.fit3, all[all$neighborhood==1,], type="prob")
glm.predictionsqda1 <-glm.predictionsqda1 %>% mutate(.prediction=ifelse(glm.predictionsqda1$Blue.Tarp>0.99, "Blue.Tarp", "NonTarp"))
confmatqda1<-caret::confusionMatrix(as.factor(glm.predictionsqda1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1qda<-confmatqda1$table[1,2]*1500+confmatqda1$table[2,1]*300+confmatqda1$table[2,2]*300
budget_remaining_qda1<-team1budget-depleted1qda
budget_remaining_qda1
#> [1] 14400
glm.predictionsqda2 <- predict(glm.fit1, all[all$neighborhood==2,], type="prob")
glm.predictionsqda2 <-glm.predictionsqda2 %>% mutate(.prediction=ifelse(glm.predictionsqda2$Blue.Tarp>0.99, "Blue.Tarp", "NonTarp"))
confmatqda2<-caret::confusionMatrix(as.factor(glm.predictionsqda2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2qda<-confmatqda2$table[1,2]*900+confmatqda2$table[2,1]*900+confmatqda2$table[2,2]*300
budget_remaining_qda2<-team2budget-depleted2qda
budget_remaining_qda2
#> [1] -5100
budget_remaining_qda<-budget_remaining_qda1+budget_remaining_qda2
budget_remaining_qda
#> [1] 9300set.seed(3)
glm.fit5 <- caret::train(mod3, data=haiti, method='rf',
importance=FALSE, ntree=15,
tuneGrid=data.frame(mtry=2), #all features are selected for bagging
trControl=caret::trainControl("cv", number=10,
savePredictions=TRUE,
returnResamp='all',classProbs=TRUE,allowParallel = TRUE))
glm.predictionsrf <- predict(glm.fit5, all, type="prob")
glm.predictionsrf <-glm.predictionsrf %>% mutate(.prediction=ifelse(glm.predictionsrf$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatrf<-caret::confusionMatrix(as.factor(glm.predictionsrf$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmatrf
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 157 34
#> NonTarp 52 11808
#>
#> Accuracy : 0.9929
#> 95% CI : (0.9912, 0.9943)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.7814
#>
#> Mcnemar's Test P-Value : 0.06678
#>
#> Sensitivity : 0.9971
#> Specificity : 0.7512
#> Pos Pred Value : 0.9956
#> Neg Pred Value : 0.8220
#> Precision : 0.9956
#> Recall : 0.9971
#> F1 : 0.9964
#> Prevalence : 0.9827
#> Detection Rate : 0.9798
#> Detection Prevalence : 0.9842
#> Balanced Accuracy : 0.8742
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionsrf$Blue.Tarp<0.4,glm.predictionsrf$Blue.Tarp>0.4) ### Neighborhood Level Performance
set.seed(98)
glm.predictionsrf1 <- predict(glm.fit5, all[all$neighborhood==1,], type="prob")
glm.predictionsrf1 <-glm.predictionsrf1 %>% mutate(.prediction=ifelse(glm.predictionsrf1$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatrf1<-caret::confusionMatrix(as.factor(glm.predictionsrf1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1rf<-confmatrf1$table[1,2]*900+confmatrf1$table[2,1]*900+confmatrf1$table[2,2]*300
budget_remaining_rf1<-team1budget-depleted1rf
budget_remaining_rf1
#> [1] -14700
glm.predictionsrf2 <- predict(glm.fit5, all[all$neighborhood==2,], type="prob")
glm.predictionsrf2 <-glm.predictionsrf2 %>% mutate(.prediction=ifelse(glm.predictionsrf2$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatrf2<-caret::confusionMatrix(as.factor(glm.predictionsrf2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2rf<-confmatrf2$table[1,2]*900+confmatrf2$table[2,1]*900+confmatrf2$table[2,2]*300
budget_remaining_rf2<-team2budget-depleted2rf
budget_remaining_rf2
#> [1] 10200
budget_remaining_rf<-budget_remaining_rf1+budget_remaining_rf2
budget_remaining_rf
#> [1] -4500set.seed(4)
caret.svm.linear <- caret::train(mod3,data=haiti, method='svmLinear', scale=FALSE,
tuneGrid=data.frame(C=14),
trControl=caret::trainControl("cv", number=10,
returnResamp='all', savePredictions='final',
classProbs=TRUE,allowParallel=TRUE))
#> line search fails -3.192122e-05 3.901076 0.02493286 -4.742438e-07 -1.876814e-14 1.431243e-09 -1.146701e-15line search fails -2.707928e-05 3.870037 7.623979e-05 -3.95171e-09 -7.826447e-17 7.980098e-12 -3.75019e-20line search fails -7.620795e-05 3.901636 0.0002531813 -1.118458e-08 -1.541345e-15 4.96214e-11 -9.452341e-19line search fails -0.0001922799 3.158826 0.006856271 -1.028806e-07 -1.963767e-13 6.771668e-10 -1.416079e-15line search fails -0.0001767447 4.775168 0.001012516 -6.404229e-08 -3.371017e-14 5.22493e-10 -6.759374e-17
glm.predictionssvm <- predict(caret.svm.linear, all, type="prob")
glm.predictionssvm <-glm.predictionssvm %>% mutate(.prediction=ifelse(glm.predictionssvm$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatsvm<-caret::confusionMatrix(as.factor(glm.predictionssvm$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmatsvm
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 209 1220
#> NonTarp 0 10622
#>
#> Accuracy : 0.8988
#> 95% CI : (0.8932, 0.9041)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2319
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.8970
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.1463
#> Precision : 1.0000
#> Recall : 0.8970
#> F1 : 0.9457
#> Prevalence : 0.9827
#> Detection Rate : 0.8814
#> Detection Prevalence : 0.8814
#> Balanced Accuracy : 0.9485
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionssvm$Blue.Tarp<0.4,glm.predictionssvm$Blue.Tarp>0.4) glm.predictionssvm1 <- predict(caret.svm.linear, all[all$neighborhood==1,], type="prob")
glm.predictionssvm1 <-glm.predictionssvm1 %>% mutate(.prediction=ifelse(glm.predictionssvm1$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatsvm1<-caret::confusionMatrix(as.factor(glm.predictionssvm1$.prediction), as.factor(all$binary[all$neighborhood==1]),positive="NonTarp",mode="everything")
depleted1svm<-confmatsvm1$table[1,2]*900+confmatsvm1$table[2,1]*900+confmatsvm1$table[2,2]*300
budget_remaining_svm1<-team1budget-depleted1svm
budget_remaining_svm1
#> [1] -170400
glm.predictionssvm2 <- predict(glm.fit5, all[all$neighborhood==2,], type="prob")
glm.predictionssvm2 <-glm.predictionssvm2 %>% mutate(.prediction=ifelse(glm.predictionssvm2$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatsvm2<-caret::confusionMatrix(as.factor(glm.predictionssvm2$.prediction), as.factor(all$binary[all$neighborhood==2]),positive="NonTarp",mode="everything")
depleted2svm<-confmatsvm2$table[1,2]*900+confmatsvm2$table[2,1]*900+confmatsvm2$table[2,2]*300
budget_remaining_svm2<-team2budget-depleted2svm
budget_remaining_svm2
#> [1] 10200
budget_remaining_svm<-budget_remaining_svm1+budget_remaining_svm2
budget_remaining_svm
#> [1] -160200Below is the final result from the held-out data.
| Method | KNN(k=5) | QDA | LDA | Logistic Regression | Support Vector (C=14 linear) | Random Forest (m=2, n=15) |
|---|---|---|---|---|---|---|
| Accuracy | 0.9926 | 0.9991 | 0.9979 | 0.9988 | 0.898 | 0.9929 |
| AUC (P/R Curve) | 0.975 | 0.981 | 0.975 | 0.975 | 0.853 | 0.978 |
| Threshold | 0.4 | 0.99 | 0.2 | 0.2 | 0.4 | 0.4 |
| Sensitivity/Recall/Power | 0.9963 | 09983 | 0.9981 | 0.9981 | 0.8970 | 0.9971 |
| Specificity (1-FPR) | 0.7847 | 0.9474 | 0.9856 | 0.9952 | 1 | 0.7512 |
| FDR (1-Precision) | 0.0038 | 0.009 | 0.0003 | 0.0004 | 0.0038 | 0.059 |
| Precision (PPV) | 0.9962 | 0.991 | 0.9997 | 0.9999 | 1 | 0.9956 |
| Budget Remaining | -12,600 | 9,300 | 23,400 | 42,300 | -160,200 | -4,500 |
The first observation I make is that there were again very many “close calls” across the statistical learning methods. In fact, there were almost none of the same “winners” as in part 1 using 10-folds cross-validation.
The context of sampling variability is still important; while I only applied one hold-out set so there is no need to consider cross-validation at this step, the variability tolerated by each trained model is important to remember. Across methods, the average standard deviation looked like this for the metrics we are interested in:
| \(Average St | | | Deviation\) | |
|---|---|
| Accuracy | 0.002 |
| Recall | 0.004 |
| Specificity | 0.007 |
| Precision | 0.0001 |
Given the margins between methods, the metrics must all be taken with some discretion. However, there are some clear takeaways: Logistic Regression performs well across the highest number of metrics, and appears to be robust enough to provide both high accuracy and stay within budget for this venture. I imagine that this will consistently perform well against new data and will test that theory below. Support Vector Machines produce very high precision and specificity, meaning this is what I may want to use in an emergency situation when there is little time left or resources are running scarce and precision becomes key. Quadraditc discriminant analysis is now a consistently high performer in terms of accuracy while maintaining high metrics in recall and precision and staying within budget. This was originally perplexing to me because again, I do not necessarily think there would be a quadratic decision boundary. However, given the palette of colors we are working with, it may be that this algorithm is more accurately handling the range of colors. I did notice a slight divergence from a linear relationship in the “pairs” plots at the beginning for high “blue” color values. Random Forest did not stand out from the pack like I expected it might. I suspect that this has to do with over-fitting to the training dataset. I used the code below to vary these a bit for random forest with the new predictions, and found I could get marginally (though, given the comment about variability, not significantly) better results. This is a lesson in spending more time considering aspects of how flexible the model needs to be
glm.fit5_alt <- caret::train(mod3, data=haiti, method='rf',
importance=FALSE, ntree=25,
tuneGrid=data.frame(mtry=5), #all features are selected for bagging
trControl=caret::trainControl("cv", number=10,
savePredictions=TRUE,
returnResamp='all',classProbs=TRUE,allowParallel = TRUE))
glm.predictionsrf <- predict(glm.fit5_alt, all, type="prob")
glm.predictionsrf <-glm.predictionsrf %>% mutate(.prediction=ifelse(glm.predictionsrf$Blue.Tarp>0.4, "Blue.Tarp", "NonTarp"))
confmatrf<-caret::confusionMatrix(as.factor(glm.predictionsrf$.prediction), as.factor(all$binary),positive="NonTarp",mode="everything")
confmatrf
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Blue.Tarp NonTarp
#> Blue.Tarp 173 69
#> NonTarp 36 11773
#>
#> Accuracy : 0.9913
#> 95% CI : (0.9895, 0.9929)
#> No Information Rate : 0.9827
#> P-Value [Acc > NIR] : 8.19e-16
#>
#> Kappa : 0.7628
#>
#> Mcnemar's Test P-Value : 0.001791
#>
#> Sensitivity : 0.9942
#> Specificity : 0.8278
#> Pos Pred Value : 0.9970
#> Neg Pred Value : 0.7149
#> Precision : 0.9970
#> Recall : 0.9942
#> F1 : 0.9956
#> Prevalence : 0.9827
#> Detection Rate : 0.9769
#> Detection Prevalence : 0.9799
#> Balanced Accuracy : 0.9110
#>
#> 'Positive' Class : NonTarp
#>
curve(glm.predictionsrf$Blue.Tarp<0.4,glm.predictionsrf$Blue.Tarp>0.4)One can imagine a higher-stakes second phase of this effort, given that only one method recalled all of the blue tarps while staying in budget.
For this effort, I imagine that the cost of both misses and false positives is equal (USD 2000) in order to drive both precision and recall. My theory is that logistic regression will work well again, so I tested this against the neighborhood 2 observations missed using the new constraints.
alln1<-all[all$neighborhood==1,] %>% mutate(logmisses1=data.frame(glm.predictionslog1$.prediction!=all$binary[all$neighborhood==1]))
glm.predictionslogmisses <- predict(glm.fit2, alln1[logmisses1==TRUE,], data=all,type="prob")
#> Error in `[.data.frame`(alln1, logmisses1 == TRUE, ): object 'logmisses1' not found
glm.predictionslogmisses <-glm.predictionslogmisses%>% mutate(.prediction=ifelse(glm.predictionslogmisses$Blue.Tarp>0.99, "Blue.Tarp", "NonTarp"))
#> Error in eval(lhs, parent, parent): object 'glm.predictionslogmisses' not found
confmatlogmisses<-caret::confusionMatrix(as.factor(glm.predictionslogmisses$.prediction), factor(alln1$binary[logmisses1==TRUE],levels=c('Blue.Tarp','NonTarp')),positive="NonTarp",mode="everything")
#> Error in is.factor(x): object 'glm.predictionslogmisses' not found
depletedsecondpass<-confmatlogmisses$table[1,2]*2000+confmatlogmisses$table[2,1]*2000+confmatlogmisses$table[2,2]*500
#> Error in eval(expr, envir, enclos): object 'confmatlogmisses' not found
budget_remaining_log1<-43000-depletedsecondpass
#> Error in eval(expr, envir, enclos): object 'depletedsecondpass' not found
budget_remaining_log1
#> [1] -2400As predicted, logistic regression performs well (using a new threshold parameter found by re-running the fit). As it turned out, there were no remaining tarps in neighborhood one after the first pass, but the performance of the algorithm lends itself to high precision with this many observations remaining, so the budget doesn’t get depleted. One could continue using this approach, retuning costs (thereby essentially retuning F_beta score) as the mission evolves. I will discuss this a little more in my final remarks.
In module 3.3 of the SYS 6018 course, we compared classification methods K-Nearest Neighbors, Linear Discriminant Analysis, and Quadratic Discriminant Analysis. According to the text, the best performing algorithm will usually be the one whose assumptions best match the data, and that performance will be poor if the model makes assumptions that are not represented in the data. Logistic regression assumes that log-odds (an expression of probability of an event occurring relative to not occurring) of the predictors are useful. KNN offers flexible parameters as it makes no assumptions. LDA and QDA assume that all classes have Gaussian distributions.
The data seem like they would be well suited to linear discriminant analysis, because there should be a clear linear boundary between complementary color classes (not quadratic) and they should have Gaussian distributions given their real world frequency. However, we can see below that is not clear from this dataset that all of the predictors have the same covariance.
cov(haiti$Blue,haiti$Green)
#> [1] 4302.854
cov(haiti$Red,haiti$Green)
#> [1] 5624.511
cov(haiti$Red,haiti$Blue)
#> [1] 4530.342I believe this is the result of unequal representation of red, green, and blue pixels in the imagery, skewed away from blue. This makes it easier to describe the response variable as a binary one, making it a binary classification problem, and either logistic regression or knn are very well suited for that.
One item that crossed my mind at the beginning of the effort is whether binary logistic regression is appropriate given that we have more than just binary data. I understand from online reading that multinomial logistic regression is a technique which can be used for clearer modeling between 3 or more classes. I think this could help us understand other structures a little better. The decision boundary could also be visualized on a plot across all classes (like it was in Module 3), to clearly show how confident we are at each class boundary in any of our methods (that was a powerful tool that I tried to find some code to mimic but it was taking a very long time even to test).
Additionally, dimension reduction techniques like principal components analysis could help with finding even more informed probabilities when predicting new observations. I am not as equipped at what to do with the principal components model, i.e., how to plug it back in for testing and interpret the components, but I know we can compute it using the following technique:
set.seed(2)
#http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/152-principal-component-and-partial-least-squares-regression-essentials/
#https://rstudio-pubs-static.s3.amazonaws.com/579414_4aea8984cf1d47abac389fde7eb6d24d.html
pcr.fits <- pcr(Class~., data=haiti.train,scale=TRUE,validation="CV")Another improvement would be to evaluate runtimes of the algorithms used here. I found that, generally speaking, support vector machines took longer to run even with multi-core threading and held-constant parameters. There may be more efficient and/or effective kernel functions out there than the methods I am plugging into caret
One could also do a lot more with visualizations. For example. I originally intended to use the color palettes from the rbg() function in coloring data points, as a potential anomaly detection measure for weeding out blue tarps that were actually bodies of water (which I suspect is what happened to the “neighborhood” with no blue tarps.
This analysis could be repeated using Bayesian machine learning techniques, which I am a lot less qualified to discuss. Given that this was a fairly widespread trove of images in our hold out dataset, we may now be able to apply the model with the lowest standard deviation in testing and highest resistance to overfitting new observations, random-foreset, during other relief efforts in the future. THe objects that we are classifying (soil, for example) should be quite common, though the landscape may differ a bit. This strikes me as especially important, because disasters of this scale do not occur everyday.
My colleague Shilpa Narayan discussed at length why the “F_beta” score method could be used widely from a risk management standpoint. It was Shilpa’s discussion that led me to consider different costs for different neighborhoods. While I used cost as a means of quantifying risk in part 2, with more unknowns this method could be a tool for simulating more specific risk-opportunity scenarios. With more time, I would have liked to create more matricies to do such modeling.
I credit my UVa Data Science cohort and Professor Schwartz with a lot of support on the journey that led to this document during Fall 2020.